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Abstract 

In this paper we address the problem of decision making within a Markov de¬ 
cision process (MDP) framework where risk and modeling errors are taken into 
account. Our approach is to minimize a risk-sensitive conditional-value-at-risk 
(CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to 
such problem as CVaR MDR Our first contribution is to show that a CVaR objec¬ 
tive, besides capturing risk sensitivity, has an alternative interpretation as expected 
cost under worst-case modeling errors, for a given error budget. This result, which 
is of independent interest, motivates CVaR MDPs as a unifying framework for 
risk-sensitive and robust decision making. Our second contribution is to present 
an approximate value-iteration algorithm for CVaR MDPs and analyze its conver¬ 
gence rate. To our knowledge, this is the first solution algorithm for CVaR MDPs 
that enjoys error guarantees. Finally, we present results from numerical exper¬ 
iments that corroborate our theoretical findings and show the practicality of our 
approach. 


1 Introduction 

Decision making -within the Markov decision process (MDP) frame-work typically in¬ 
volves the minimization of a risk-neutral performance objective, namely the expected 
total discounted cost El. This approach, -while very popular, natural, and attractive 
from a computational standpoint, neither takes into account the variability of the cost 
(i.e., fluctuations around the mean), nor its sensitivity to modeling errors, which may 
significantly affect overall performance ca. Risk-sensitive MDPs ||3 address the first 
aspect by replacing the risk-neutral expectation with a risk-measure of the total dis¬ 
counted cost, such as variance, Value-at-Risk (VaR), or Conditional-VaR (CVaR). Ro¬ 
bust MDPs Qa, on the other hand, address the second aspect by defining a set of 
plausible MDP parameters, and optimize decision with respect to the worst-case sce¬ 
nario. 

In this work we consider risk-sensitive MDPs with a CVaR objective, referred to 
as CVaR MDPs. CVaR EKUl is a risk-measure that is rapidly gaining popularity in 
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various engineering applications, e.g., finance, due to its favorable computational prop¬ 
erties IT] and superior ability to safeguard a decision maker from the “outcomes that 
hurt the most” 11211 . In this paper, by relating risk to robustness, we derive a novel re¬ 
sult that further motivates the usage of a CVaR objective in a decision-making context. 
Specifically, we show that the CVaR of a discounted cost in an MDP is equivalent to the 
expected value of the same discounted cost in presence of worst-case perturbations of 
the MDP parameters (specifically, transition probabilities), provided that such pertur¬ 
bations are within a certain error budget. This result suggests CVaR MDP as a method 
for decision making under both cost variability and model uncertainty, motivating it as 

unified framework for banning under uncertainty. 

Literature review. Risk-sensitive MDPs have been studied for over four decades, 

with earlier efforts focusing on exponential utility i), mean-variance ll23l . and per¬ 
centile risk criteria Q • Recently, for the reasons explained above, several authors 
have investigated CVaR MDPs lfT9l . Specifically, in Q, the authors propose a dy¬ 
namic programming algorithm for finite-horizon risk-constrained MDPs where risk is 
measured according to CVaR. The algorithm is proven to asymptotically converge to an 
optimal risk-constrained policy. However, the algorithm involves computing integrals 
over continuous variables (Algorithm 1 in jT]) and, in general, its implementation ap¬ 
pears particularly difficult. In 121, the authors investigate the structure of CVaR optimal 
policies and show that a Markov policy is optimal on an augmented state space, where 
the additional (continuous) state variable is represented by the running cost. In |3, 
the authors leverage such result to design an algorithm for CVaR MDPs that relies on 
discretizing occupation measures in the augmented-state MDP. This approach, how¬ 
ever, involves solving a non-convex program via a sequence of linear-programming 
approximations, which can only shown to converge asymptotically. A different ap¬ 
proach is taken by |5) and Il24l . which consider a finite dimensional parameterization 
of control policies, and show that a CVaR MDP can be optimized to a local optimum 
using stochastic gradient descent (policy gradient). A recent result by Pflug and Pichler 
El showed that CVaR MDPs admit a dynamic programming formulation by using a 
state-augmentation procedure different from the one in 121. The augmented state is also 

continuous, making the design of a solution algorithm challei^ing. 

Contributions: The contribution of this paper is twofold. First, as discussed above, 

we provide a novel interpretation for CVaR MDPs in terms of robustness to model¬ 
ing errors. This result is of independent interest and further motivates the usage of 
CVaR MDPs for decision making under uncertainty. Second, we provide a new opti¬ 
mization algorithm for CVaR MDPs, which leverages the state augmentation procedure 
introduced by Pflug and Pichler ITTll . We overcome the aforementioned computational 
challenges (due to the continuous augmented state) by designing an algorithm that 
merges approximate value iteration El with linear interpolation. Remarkably, we are 
able to provide explicit error bounds and convergence rates based on contraction-style 
arguments. In comparison to the algorithms in IH [8l |5] l24ll . our approach leads to 
finite-time error guarantees, with respect to the globally optimal policy. In addition, 
our algorithm is significantly simpler than previous methods, and calculates the op¬ 
timal policy for all CVaR confidence intervals and initial states simultaneously. The 
practicality of our approach is demonstrated in numerical experiments involving plan¬ 
ning a path on a grid with thousand of states. To the best of our knowledge, this is the 
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first algorithm to compute globally-optimal policies for non-trivial CVaR MDPs. 

Organization: This paper is structured as follows. In Sections we provide back¬ 
ground on CVaR and MDPs, we state the problem we wish to solve (i.e., CVaR MDPs), 
and motivate the CVaR MDP formulation by establishing a novel relation between 
CVaR and model perturbations. Section provides the basis for our solution algo¬ 
rithm, based on a Bellman-style equation for the CVaR. Then, in Section|^we present 
our algorithm and correctness analysis, in Section]^ we evaluate our approach via 
numerical experiments. Finally, in Section]^ we draw some conclusions and discuss 
directions for future work. 


2 Preliminaries, Problem Formulation, and Motivation 

2.1 Conditional Value-at-Risk 

Let Z be a bounded-mean random variable, i.e., E[|Z|] < oo, on a probability space 
(n, P), with cumulative distribution function F{z) = P(Z < z). In this paper we 

interpret Z as a cost. The value-at-risk (VaR) at conbdence level a S (0,1) is the 1 — a 
quantile of Z, i.e., VaRQ(Z) = min {z | F{z) > a}. The conditional value-at-risk 
(CVaR) at confidence level a S (0,1) is dehned as lfT9l : 

CVaR^fZ) = min H—E[(Z — V (1) 

tuGR I o'- ■' J 


where (a;)+ = max(a:,0) represents the positive part of x. If there is no probability 
atom at VaRQ,(Z), it is well known that CVaRQ,(Z) = E[Z | Z > VaRQ,(Z)]. There¬ 
fore, CVaRQ,(Z) may be interpreted as the worst case expected value of Z, conditioned 
on the a-portion of the tail distribution. It is well known that CVaRo,(Z) is decreasing 
in a, CVaRi(Z) equals to E(Z), and CVaRa(Z) tends to max(Z) as a | 0. During 
the last decade, the CVaR risk-measure has gained popularity in hnancial applications, 
among others. It is especially useful for controlling rare, but potentially disastrous 
events, which occur below the 1 — a quantile, and are neglected by the VaR ll2Ti . Fur¬ 
thermore, CVaR enjoys desirable axiomatic properties, such as coherence Cl. We refer 
to ll25l for further motivation about CVaR and a comparison with other risk measures 
such as VaR. 

A useful property of CVaR, which we exploit in this paper, is its alternative dual 
representation d: 

CVaR„(Z) = max EflZl, (2) 

C^^/cVaR(Q:,P) 


where denotes the ^-weighted expectation of Z, and the risk envelop WcvaR 


is given by (YcvaR(a,E) = i C G 


0 ,^ 


> Leo ?(w)P(a;)dw = 1 j . Thus, the 


CVaR of a random variable Z may be interpreted as the worst-case expectation of Z, 
under a perturbed distribution ^P. 

In this paper, we are interested in the CVaR of the total discounted cost in a sequen¬ 
tial decision-making setting, as discussed next. 
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2.2 Markov Decision Processes 


An MDP is a tuple A4 = (A^,A,C,P,xo,'y), where X and A are finite state and 
action spaces; C{x, a) G [—Cmaxi Cmax] is a bounded deterministic cost; P{-\x, a) is 
the transition probability distribution; 7 G [ 0 , 1 ) is the discounting factor, and xq is the 
initial state. (Our results easily generalize to random initial states and random costs.) 

Let the space of admissible histories up to time the Ht = Ht-i x X, for t > 1, and 
Hq = X. a generic element ht G Ht is of the form ht = (xq, uq, ..., Xt-i, at-i,Xt). 
Let If //1 be the set of all deterministic history-dependent policies with the property 
that at each time t the control is a function of ht- In other words. If //1 := {/ig : Hq —> 
A, Hi : Hi ^ A,fj,t : Ht ^ A}\Hj{hj) G A for all hj G Hj, 1 < j < t}. We 
also let Ah = limt_>oo hin^t be the set of all history dependent policies. 

2.3 Problem Formulation 

Let the sequence of random variables Zt denote the stage-wise costs observed along a 
state/control trajectory in the MDP model, and let Cq.t = Y^'t=o 7*^* denote the total 
discounted cost up to time T. The risk-sensitive discounted-cost problem we wish to 
address is as follows: 


min 


CVaR^ 


lim Co T 
T—>-oo ’ 




(3) 


where h = the policy sequence with actions at = /it(fit) for t G 

{0,1,...}. We refer to problem ([^ as CVaR MDP. (One may also consider a related 
formulation combining mean and CVaR, the details of which are presented in the sup¬ 
plementary material.) 

The problem formulation in (|^ directly addresses the aspect of risk sensitivity, as 
demonstrated by the numerous applications of CVaR optimization in finance (see, e.g., 
120 ][n]ID) and the recent approaches for CVaR optimization in MDPs ||4][8]0[24l . In 
the following, we show a new result providing additional motivation for CVaR MDPs, 
from the point of view of robustness to modeling errors. 

2.4 Motivation - Robustness to Modeling Errors 

We show a new result relating the CVaR objective in ([^ to the worst-case expected 
discounted-cost in presence of worst-case perturbations of the MDP parameters, where 
the perturbations are budgeted according to the “number of things that can go wrong.” 
Thus, by minimizing CVaR, the decision maker also guarantees robustness of the pol¬ 
icy. 

Consider a trajectory (xg,... ,xt) in a finite-horizon MDP problem with tran¬ 
sitions Pt{xt\xt-i). We explicitly denote the time index of the transition matrices 
for reasons that will become clear shortly. The total probability of the trajectory is 
p(xg, . . . ,Xt) = Pg(xg)Pi(xi|xg) • • • Pt{xt\xt-i), and we let Cg,T(xi, ... ,Xt) 
denote its discounted cost, as defined above. 

We consider an adversarial setting, where an adversary is allowed to change the 
transition probabilities at each stage, under some budget constraints. We will show 
that, for a specific budget and perturbation structure, the expected cost under the worst- 
case perturbation is equivalent to the CVaR of the cost. Thus, we shall establish that. 
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in this perspective, being risk sensitive is equivalent to being robust against model 
perturbations. 

For each stage 1 < < < T, consider a perturbed transition matrix Pt = Pt o St, 
where St G is a multiplicative probability perturbation and o is the Hadamard 

product, under the condition that Pt is a stochastic matrix. Let Aj denote the set of 
perturbation matrices that satisfy this condition, and let A = Ai x • • • x At the set of 
all possible perturbations to the trajectory distribution. 

We now impose a budget constraint on the perturbations as follows. For some 
budget f] > 1 , we consider the constraint 

Si{xi\xo)S2{x2\xi) ■ ■ ■ St{xt\xt-i) < rj, '^xi,..., xt G X ,t = 0,... ,T. (4) 

Essentially, the product in Eq. 0 states that the worst cannot happen at each time. 
Instead, the perturbation budget has to be split (multiplicatively) along the trajectory. 
We note that Eq. Q is in fact a constraint on the perturbation matrices, and we denote 
by A^ C A the set of perturbations that satisfy this constraint with budget rj. The 
following result shows an equivalence between the CVaR and the worst-case expected 
loss. 

Proposition 1 (Interpretation of CVaR as a Robustness Measure) It holds 

CVaRi{CQ^T{xi,.. . ,xt)) = sup Ep [Co,T(a:i, ■ ■ ■, a^r)], (5) 

where E p [•] denotes expectation with respect to a Markov chain with transitions Pt. 

The proof of Proposition [T] is in the supplementary material. It is instructive to 
compare Proposition[T]with the dual representation of CVaR in 0. Note, in particular, 
that the perturbation budget in Proposition[2has a temporal structure, which constrains 
the adversary from choosing the worst perturbation at each time step. 

Remark 1 An equivalence between robustness and risk-sensitivity was previously sug¬ 
gested by Osogami m- In that study, the iterated (dynamic) coherent risk was shown 
to be equivalent to a robust MDP HlO]l with a rectangular uncertainty set. The iterated 
risk (and, correspondingly, the rectangular uncertainty set) is very conservative 
in the sense that the worst can happen at each time step. In contrast, the perturbations 
considered here are much less conservative. In general, solving robust MDPs without 
the rectangularity assumption is NP-hard. Nevertheless, Mannor et. al. 4751/ showed 
that, for cases where the number of perturbations to the parameters along a trajectory 
is upper bounded (budget-constrained perturbation), the corresponding robust MDP 
problem is tractable. Analogous to the constraint set (1) in the perturbation set 
in Proposition limits the total number of log-perturbations along a trajectory. Ac¬ 
cordingly, we shall later see that optimizing problem 0 with perturbation structure 
0 is indeed also tractable. 

Next section provides the fundamental theoretical ideas behind our approach to the 
solution of 0 . 
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3 Bellman Equation for CVaR 

In this section, by leveraging a recent result from ini, we present a dynamic program¬ 
ming (DP) formulation for the CVaR MDP problem in ([^. As we shall see, the value 
function in this formulation depends on both the state and the CVaR confidence level 
a. We then establish important properties of such DP formulation, which will later 
enable us to derive an efficient DP-based approximate solution algorithm and provide 
correctness guarantees on the approximation error. All proofs are presented in the sup¬ 
plementary material. 

Our starting point is a recursive decomposition of CVaR, whose proof is detailed in 
Theorem 10 of iflTl . 

Theorem 2 (CVaR Decomposition Theorem, lIlTi l For any t > 0, denote by Z = 
(Z*+i,Z,+2 ,...) the cost sequence from time t -\- 1 onwards. The conditional CVaR 
under policy p, i.e., CVaRa{Z \ Ht,p), obeys the following decomposition: 

CVaRa{Z \ Ht,p) = max E[^{xt+i)-CVaRa^^^^^^){Z \ Ht+i, p) \ Ht, p], 

^GUcvaR{a,P{-\xt,at)) 

where at is the action induced by policy pt(hf), and the expectation is with respect 
to Xt+l- 

Theorem 1^ concerns a fixed policy p; we now extend it to a general DP formulation. 
Note that in the recursive decomposition in Theorem the right-hand side involves 
CVaR terms with different confidence levels than that in the left-hand side. Accord¬ 
ingly, we augment the state space X with an additional continuous state y = (0,1], 
which corresponds to the confidence level. For any x & X and y G y, the value- 
function V(x, y) for the augmented state {x, y) is defined as: 

V{x,y) = min CVaRj, ( lim Cq^t \ xq = x, p \ . 

\T —^oo / 

Similar to standard DP, it is convenient to work with operators defined on the space of 
value functions a. In our case. Theorem [pleads to the following definition of CVaR 
Bellman operator X : X x y ^ X x y-. 


T\V]{x, y) = min 
aeA 


C{x,a)-\-y max ^ ^{x')V {x',y^{x')) P{x'\x,a) 

C 0.1 ir-.., T r, ( 'll P f . \ -r /-I^^ 

( 6 )' 


5eWcvaR(y,P('|a:,a)) f- 
x' 


We now establish several useful properties for the Bellman operator T[V]. 


Lemma 3 (Properties of CVaR Bellman Operator) The Bellman operator T[V] has 
the following properties: 


1. (Contraction.) ||T[Vi]-T[V 2 ]||oo < y\\Vi-V 2 \\oo, where \\f\\oo=sup^^x,\f{x,y)\- 

2. (Concavity preserving in y.) For any x G X, suppose yV{x,y) is concave 
in y G y. Then the maximization problem in is concave. Furthermore, 
yX\V](x, y) is concave in y. 
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The first property in Lemma is similar to standard DP lO, and is instrumental to 
the design of a converging value-iteration approach. The second property is nonstan¬ 
dard and specific to our approach. It will be used to show that the computation of 
value-iteration updates involves concave, and therefore tractable optimization prob¬ 
lems. Furthermore, it will be used to show that a linear-interpolation of V(x, y) in the 
augmented state y has a bounded error. 

Equipped with the results in Theorem]^ and Lemmawe can now show that the 
fixed point solution of T[L](a:, y) = V(x, y) is unique, and equals to the solution of 
the CVaR MDP problem (|^ with xq = x and a = y. 

Theorem 4 (Optimality Condition) For any x G X and y G (0,1], the solution to 

T[L](x, y) = V{x,y) is unique, and equals to V*{x,y) = min^gn^ CVaRy (limT-^oo Co,t | xq = x,y). 

Next, we show that the optimal value of the CVaR MDP problem ([^ can be attained 
by a stationary Markov policy, defined as a greedy policy with respect to the value 
function V*{x,y). Thus, while the original problem is defined over the intractable 
space of history-dependent policies, a stationary Markov policy (over the augmented 
state space) is optimal, and can be readily derived from V*{x,y). Furthermore, an 
optimal history-dependent policy can be readily obtained from an (augmented) optimal 
Markov policy according to the following theorem. 

Theorem 5 (Optimal Policies) Let = {/ig, /ii,...} G IIb be a history-dependent 
policy recursively defined as: 

ykihk) = u*{xk,yk), Vfc > 0, (7) 

with initial conditions xq and j/o = ck. state transitions 

XkP{-\xk-i,u*{xk-i,yk-i)), yk = yk-iCk-i,yk-i,u-i.^k),^k>l, (8) 

where the stationary Markovian policy u*{x,y) and risk factor solu¬ 

tion to the min-max optimization problem in the CVaR Bellman operator T[L*](a:, y). 

Then, is an optimal policy for problem ID with initial state Xq and CVaR confidence 

level a. 

Theorems|^and|^suggest that a value-iteration DP method |13 can be used to solve 
the CVaR MDP problem ([^. Let an initial value-function guess Vg : x 3^ “5“ R be 

chosen arbitrarily. Value iteration proceeds recursively as follows: 

Vk+i{x, y) = T[Vk]{x, y), V{x, y) G X x y, k G {GX ... (9) 

Specifically, by combining the contraction property in Lemma and uniqueness re¬ 
sult of fixed point solutions from Theorem]^ one concludes that limfe_>.oo Vk{x, y) = 

V*{x,y). By selecting x — xq and y = a, one immediately obtains V*{xo,a) = 
min^gn_ff CVaRo, (lim7’_>.oo Cq^t \ XQ,p,). Furthermore, an optimal policy may be de¬ 
rived from V* {x, y) according to the policy construction procedure in Theorem]^ 

Unfortunately, while value iteration is conceptually appealing, its direct implemen¬ 
tation in our setting is generally impractical since, e.g., the state y is continuous. In the 
following, we pursue an approximation to the value iteration algorithm (|^, based on a 
linear interpolation scheme for y. 
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Algorithm 1 CVaR Value Iteration with Linear Interpolation 

1; Given: 

• N{x) interpolation points Y(a;) = {j/i,... ^yNix)} G [0, for every x S 

Y with yi < yi+i, yi = 0 and yM{x) = 1 - 

• Initial value function Vo{x, y) that satisfies Assumption[^ 

2; For t = 1,2,... 

• For each x G X and each yi G Y update the value function estimate as 
follows: 

Vt(x,yi) = Tx[Vt-i](x,yi), 


3: Set the converged value iteration estimate as V*(x,yi), for any x G X, and yi G 
Y(x). 


4 Value Iteration with Linear Interpolation 

In this section we present an approximate DP algorithm for solving CVaR MDPs, based 
on the theoretical results of Section]^ The value iteration algorithm in Eq. (|^ presents 
two main implementation challenges. The first is due to the fact that the augmented 
state y is continuous. We handle this challenge by using interpolation, and exploit the 
concavity of yV{x,y) to bound the error introduced by this procedure. The second 
challenge stems from the the fact that applying T involves maximizing over Our 
strategy is to exploit the concavity of the maximization problem to guarantee that such 
optimization can indeed be performed effectively. 

As discussed, our approach relies on the fact that the Bellman operator T preserves 
concavity as established in Lemma Accordingly, we require the following assump¬ 
tion for the initial guess V[){x, y), 


Assumption 1 The guess for the initial value function Vo{x, y) satisfies the following 
properties: 1) yVo{x,y) is concave in y G y and 2) Vo{x,y) is continuous in y G y 
for any x G X . 


Assumptionj^may easily be satisfied, for example, by choosing Vo ( 2 :, y) = CVaRy(Z | 

Xq = x), where Z is any arbitrary bounded random variable. As stated earlier, a key 
difficulty in applying value iteration (|^ is that, for each state x G X, the Bellman 
operator has to be calculated for each y G y, and y is continuous. As an approxima¬ 
tion, we propose to calculate the Bellman operator only for a finite set of values y, and 
interpolate the value function in between such interpolation points. 

Formally, let N{x) denote the number of interpolation points. For every x G X, 
denote by Y(a:) = {j/i,... ,yN{x)} € [0,the set of interpolation points. We 
denote by Xx\y](jj) the linear interpolation of the function yV{x,y) on these points, 


i.e.. 


Ix[V]iy)=y,Vix,y,) 


y^+iV{x,y^+i) - yiV{x,yi) 

y*+i - y* 


(y - yi), 
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where yi = max{j/' G Y(a:) : y' < y}. The interpolation of yV{x,y) instead of 
V(x, y) is key to our approach. The motivation is twofold: first, it can be shown lfT9l 
that for a discrete random variable Z, yCVaRj,(Z) is piecewise linear in y. Second, 
one can show that the Lipschitzness of yV{x,y) is preserved during value iteration, 
and exploit this fact to bound the linear interpolation error. 

We now define the interpolated Bellman operator Tx as follows: 



T^i[V]{x,y) = min C{x,a)+-f 
a^A 


max 

€eWcvaR(y,P('|a:,a)) 


x'GA' 


Remark 2 Notice that by L’Hospital’s rule one has\imy^Qlj;[V]{y^{x))/y = ^(a:, 0)^(a;). 
This implies that at y = 0 the interpolated Bellman operator is equivalent to the origi¬ 
nal Bellman operator, i.e., T[V]{x, 0) = min^gxi {C{x,a) + "f ma,yixi^x-.p{x'\x,a)>o Y {x', 0)} 


Tx[Y](x,0). 


Algorithm[2presents CVaR value iteration with linear interpolation. The only dif¬ 
ference between this algorithm and standard value iteration (|^ is the linear interpola¬ 
tion procedure described above. In the following, we show that Algorithm[T]converges, 
and bound the error due to interpolation. We begin by showing that the useful prop¬ 
erties established in Lemma for the Bellman operator T extend to the interpolated 
Bellman operator Tx. 

Lemma 6 (Properties of Interpolated Bellman Operator) Tx[L] has the same prop¬ 
erties o/T[Y] as in Lemma^ namely 1) contraction and 2) concavity preservation. 

Lemma|^implies several important consequences for Algorithm[T] The first one is 
that the maximization problem in ( [T 0 | l is concave, and thus may be solved efficiently 
at each step. This guarantees that the algorithm is tractable. Second, the contraction 
property in Lemmal^guarantees that Algorithm [^converges, i.e., there exists a value 
function!/* € such that lim„_>oo T5[Vo](a;, j/i) = V*{x,yi). In addition, 

the convergence rate is geometric and equals to 7. 

The following theorem provides an error bound between approximate value itera¬ 
tion and exact value iteration (|^ in terms of the interpolation resolution. 

Theorem 7 (Convergence and Error Bound) Suppose the initial value function V[j{x, y) 
satisfies Assumption and let e > 0 be an error tolerance parameter. For any state 
X G X and step f > 0, choose j /2 > 0 such that Vt{x, j/ 2 ) ~ Xtix, 0) > —e and update 
the interpolation points according to the logarithmic rule: yi+i = Oyi, Vi > 2, with 
uniform constant 9 > 1. Then, Algorithm^has the following error bound: 

0 > V*{xo, a) — min CVaRa ( lim Co t I xo, p) > -— {(9 — 1) -|- e), 

AieUff \T-ioo / 1 “ 7 

and the following finite time convergence error bound: 


Xx[Vo]{xo, a) — min CVaRa ( lim Co,t | xo,p) < 


0((g-l) + 6) + 0(7") 
1-7 
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Theorem 1^ shows that 1) the interpolation-based value function is a conservative es¬ 
timate for the optimal solution to problem (|^; 2) the interpolation procedure is con¬ 
sistent, i.e., when the number of interpolation points is arbitrarily large (specifically, 
e —> 0 and yi+i/yi —t 1), the approximation error tends to zero; and 3) the approxima¬ 
tion error bound is O((0 — 1) -f e), where log 9 is the log-difference of the interpolation 
points, i.e., logfl = logyi+i - logyi, Vi. 

For a pre-specified e, the condition Vt{x,y 2 ) — Vt{x,0) > —e may be satisfied 
by a simple adaptive procedure for selecting the interpolation points Y(a;). At each 
iteration t > 0, after calculating Vt{x,yi) in Algorithmat each state x in which 
the condition does not hold, add a new interpolation point ?/2 = |v^(a; y^f-Vtix o)| ’ 
and additional points between 2/2 y 2 such that the condition log0 > logyi+i — 
logyi is maintained. Since all the additional points belong to the segment [yi, y 2 ], the 
linearly interpolated Vt{x, yi) remains unchanged, and Algorithm[^proceeds as is. For 
bounded costs and e > 0, the number of additional points required is bounded. 

The full proof of Theoremj^is detailed in the supplementary material; we highlight 
the main ideas and challenges involved. In the first part of the proof we bound, for all 
t > 0, the Lipschitz constant of yVt{x, y) in y. The key to this result is to show that the 
Bellman operator T preserves the Lipschitz property for yVt {x,y). Using the Lipschitz 
bound and the concavity of yVt{x, y), we then bound the error — Vt{x, y) for 

all y. The condition on y 2 is required for this bound to hold when y —0. Finally, 
we use this result to bound ||Tx[Ui](a;, y) — T[Vt](a;, y)||oo- The results of Theorem Pt] 
follow from contraction arguments, similar to approximate dynamic programming |[3| 7 

5 Experiments 

We validate Algorithm[T]on a rectangular grid world, where states represent grid points 
on a 2D terrain map. An agent (e.g., a robotic vehicle) starts in a safe region and its 
objective is to travel to a given destination. At each time step the agent can move to 
any of its four neighboring states. Due to sensing and control noise, however, with 
probability 5 a move to a random neighboring state occurs. The stage-wise cost of 
each move until reaching the destination is 1, to account for fuel usage. In between the 
starting point and the destination there are a number of obstacles that the agent should 
avoid. Hitting an obstacle costs M » 1 and terminates the mission. The objective is 
to compute a safe (i.e., obstacle-free) path that is, fuel efficient. 

For our experiments, we choose a 64 x 53 grid-world (see Figure [^1, for a total of 
3,312 states. The destination is at position (60, 2), and there are 80 obstacles plotted 
in yellow. By leveraging Theorem |7] we use 21 log-spaced interpolation points for 
Algorithm [T] in order to achieve a small value function error. We choose 5 = 0.05, 
and a discount factor 7 = 0.95 for an effective horizon of 200 steps. Furthermore, we 
set the penalty cost equal to M = 2/(1 — 7 )-such choice trades off high penalty for 
collisions and computational complexity (that increases as M increases). 

In Figure we plot the value function V{x,y) for three different values of the 
CVaR confidence parameter a, and the corresponding paths starting from the initial 
position (60, 50). The first three figures in Figureshow how by decreasing the con¬ 
fidence parameter a the average travel distance (and hence fuel consumption) slightly 
increases but the collision probability decreases, as expected. We next discuss robust- 
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Figure 1: Grid-world simulation. Left three plots show the value functions and corre¬ 
sponding paths for different CVaR confidence levels. The rightmost plot shows a cost 
histogram (for 400 Monte Carlo trials) for a risk-neutral policy and a CVaR policy with 
confidence level a = 0.11. 

ness to modeling errors. We conducted simulations in which with probability 0.5 each 
obstacle position is perturbed in a random direction to one of the neighboring grid 
cells. This emulates, for example, measurement errors in the terrain map. We then 
trained both the risk-averse (a = 0.11) and risk-neutral (a = 1) policies on the nomi¬ 
nal (i.e., unperturbed) terrain map, and evaluated them on 400 perturbed scenarios (20 
perturbed maps with 20 Monte Carlo evaluations each). While the risk-neutral policy 
finds a shorter route (with average cost equal to 18.137 on successful runs), it is vul¬ 
nerable to perturbations and fails more often (with over 120 failed runs). In contrast, 
the risk-averse policy chooses slightly longer routes (with average cost equal to 18.878 
on successful runs), but is much more robust to model perturbations (with only 5 failed 
runs). 

For the computation of Algorithm [T] we represented the concave piecewise linear 
maximization problem in ( fTOl i as a linear program, and concatenated several problems 
to reduce repeated overhead stemming from the initialization of the CPLEX linear pro¬ 
gramming solver. This resulted in a computation time on the order of two hours. We 
believe there is ample room for improvement, for example by leveraging parallelization 
and sampling-based methods. Overall, we believe our proposed approach is currently 
the most practical method available for solving CVaR MDPs (as a comparison, the re¬ 
cently proposed method in ID involves infinite dimensional optimization). The Matlab 
code used for the experiments is provided in the supplementary material. 

6 Conclusion 

In this paper we presented an algorithm for CVaR MDPs, based on approximate value- 
iteration on an augmented state space. We established convergence of our algorithm, 
and derived finite-time error bounds. These bounds are useful to stop the algorithm at 
a desired error threshold. 

In addition, we uncovered an interesting relationship between the CVaR of the total 
cost and the worst-case expected cost under adversarial model perturbations. In this 
formulation, the perturbations are correlated in time, and lead to a robustness frame¬ 
work significantly less conservative than the popular robust-MDP framework, where 
the uncertainty is temporally independent. 

Collectively, our work suggests CVaR MDPs as a unifying and practical frame¬ 
work for computing control policies that are robust with respect to both stochasticity 
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and model perturbations. Future work should address extensions to large state-spaces. 
We conjecture that a sampling-based approximate DP approach 13 should be feasible 
since, as proven in this paper, the CVaR Bellman equation is contracting (as required 
by approximate DP methods). 
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A Proofs of Theoretical Results 

A.l Proof of Proposition [J 

By definition, we have that 


Ep [C{xi, xt)] = ^ Pi{xi)6i{xi) ■ ■ ■ Pt{xt\xt-i)5t{xt\xt-i)C{xi, ...,xt) 

(xi_,...,xt) 

= ^ P(xi, . . . ,Xt)Si(xi)S2(x21xi) ■ ■ ■ St(xtIxt-i)C(xi, . . . ,Xt) 

(xi,...,xt) 

= ^ P{xi,...,xt)S{xi,...,xt)C{xi,...,xt)- 

{xi,...,xt) 

Note that by definition of the set A, for any {6i,... ,St) G A we have that P{xi,..., xt) > 

0 —)■ 6{xi,..., Xt) > 0, and 

E[5(a;i,... ,xr)] = ^ P{xi,... ,xt)5{xi, ... ,xt) = 1- 

{xi,...,xt) 


Thus, 

sup Ep [<7(3:1 ,Xt)] = sup E P{xi, .. .,xt)S{xi, .. .,xt)C{xi, .. .,xt) 

(<5i,...,(5t)GA, 0<5{xi,...,XT)<ri, , , 

= CVaRi ((7(xi,...,xt)), 

where the last equality is by the representation theorem for CVaR ll22]l . 


A.2 Proof of Lemma |3] 

The proof of monotonicity and constant shift properties follow directly from the def¬ 
initions of the Bellman operator, by noting that ^{x')P{x'\x,a) is non-negative and 
^{x')P(x'|x, a)] = 1 for any ^ G UcYiiR{y,P{-\x,a)). For the contraction 
property, denote c = jjVi — V 2 ||oo- Since 

V‘ 2 {x,y) - ||Fi - 1/21100 < Vi{x,y) < V 2 {x,y) + \\Vi - I/2II00, Vx G A", y e 3 ^, 

by monotonicity and constant shift property, 

nV2]ix,y)—f\\Vi-V2\\oo < T['Fi](x,t/) < T[y2](x,2/)+7||l^i-F2||oo Vx G A”, y G 3 ^. 

This further implies that 

|T[yi](x,y) - T[V2]{x,y)\ < 7||yi - y2||oo Vx G A-, y G 3 ^ 

and the contraction property follows. 
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Now, we prove the concavity preserving property. Assume that yV(x, y) is concave 
in y for any x G X. Let t/i, t/2 G y, and A G [ 0 , 1 ], and define yx = [1 — X)yi + Aj/2- 
We have 


(1 - X)yiT[V]ix, yi) + Xy^nVjix, y^) 


=(1 - X)yi min 

ai^A 


( 7 ( 2 ;,oi)+7 max {x' ,yiCi{x')) P{x'\x, ai) 


4iGWcvaR(!/l,f ('Ljai)) 

X* 


+ Xy2 min 
(12^A 


C{x,a2)+^ max ^2{x')V {x',y2i2{x')) P{x'\x,a2) 

Cr.C.1 Pf.U 


i2&AcyiiR{y2.P{-\x,a2)) . _ 

x' 


— mm 

ai£A 


{ 1 -X)yiC{x,ai)+j ?iax ^ ^i{x')V {x',yi^i{x')) P{x'\x, ai){l - X)yi 

C, £=7y„., ^ f/i,, P('. Lr.^,^^ 


^i^UcvaR{yi,P{-\x,ai)) 

x' 


mm 

0,2^-^ 


Xy2C{x,a2)+j pax ^2ix)V {x ,y2^2{x')) P{x'\x,a2)Xy2 

^2^UcVaR{y2,Pi-\x,a2)) 

x'^X 


< min 
aGA 


yxC{x,a)+'y pax ^ P{x'\x,a) {{l-X)yi^i{x')V {x',yi^i{x')) + Xy2^2{x')V {x',y2^2{x'))) 

«ieWcvaR(yi,P(-|x,a)) ^ 

526WcvaR(y2 ,P('ia:,a)) ^ 


< min 
a^A 


yxC{x,a)+x max '^P{x'\x,a) {{ 1 -X)yi£,i{x') + Xy2^2{x')) V (x', ((l-X)yi^i(x') + Xy2£,2{x'] 

CieZ^cvaR(yi,-P(-k,a)) ^ 

^2 ^ t/cVaR ( y2 ,-P (• I iC, a ) ) ^ 


where the first inequality is by concavity of the min, and the second is by the concavity 
assumption. Now, define^ = (i-A)vigi+Ay2;2 ^ \YJien G Z^cvaR(2/i, a)) and 


6 G WcvaR(y2, -P(-|a:, a)), we have that ^ G 


We thus have 

(1 - A)yiT[y](2;, yi) + Ay2T[P](2;, ^ 2 ) 


0,P 


V\ 


and J2x'ex^i^')^i^'\^^(^) = 1- 


< min 
a^A 


yxC{x,a)+-f pax,, P{x'\x,a)yxC{x')V {x',yx^{x')) 

i&UcvaRivx,P{-\x,a)) 


=yx min 
aeA 


C{x,a)+j pax,, P{x'\x,a)^{x')V {x',yx^{x')) 

Pf.U- /lU 


^eWcVaR(yA,P(-k,a)) 

x' GX 


= y\^[V]{x,yx). 


Finally, to show that the inner problem in (|^ is a concave maximization, we need 
to show that 


yx,y,a{^') 


zV{x',z)P{x'\x,a)/y if y 0 
0 otherwise 


is a concave function in z G ffi for any given x G X, y G y and a G A. Suppose 
zV{x, z) is a concave function in z. Immediately we can see that Kx,y,a{z) is concave 
in z when y = 0 . Also notice that when y G y\ { 0 }, since the transition probability 
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P{x'\x,a) is non-negative, we have the result that h.x,y,a{z) is concave in z. This 
further implies 

x'^x y x'ex 

is concave in Furthermore by combining the result with the fact that the feasible set 
of ^ is a polytope, we complete the proof of this claim. 


A.3 Proof of Theorem |4] 

The first part of the proof is to show that for any (x, y) G X x y, 


Vnix,y) :=T"-[Vo]ix,y)= min CVaRj, (Co,„ + 7 ”fo | tco = x,/r), (11) 


/iGH 


by induction, where the initial condition is (xq, yo) = (x, y) and control action at is in¬ 
duced hyp (xt,j/t)- Forn = 1, we have that Vi(x,y) = T[Vo]{x,y) = min^gnM C!{xo,ao)+ 
7 CVaRy {C{xi,ai) + Vo(xi) | xq = x,/i) from definition. By induction hypothesis, 
assume the above expression holds atn = k. For n = k + 1, 


Vk+iix,y) :=T^+^[Vo]ix,y)=T[Vk]ix,y) 


■- mm 
a^A 


-- mm 
a^A 


- mm 
a^A 


C{x,a)+'y max ^{x')Vk(x',yC{x')']p{x'\x,a) 

CGWcvaR(y,P('|a;,a)) ^7^ \ s —^— 'J 


C{x,a)+j max ^{x')P{x'\x,a) min CVaRj,/ {Co,k I xo = x',p) 


5GWcvaR(y,P('|a^,a)) ^ 
x'^Pc 


C(x,a)+ max Ee 

ieUcViR{y,Pi-\x,a)) 


min CVaRyi {Ci^k+i + \ xi,/r) 


-- min CVaRy {Co,k+i + | Xq = x,/r) , 


( 12 ) 


where the initial state condition is given by (xq, j/o) = {x,y). Thus, the equality in 
( [TT| l is proved by induction. 

The second part of the proof is to show that l/*(xo,yo)=iiiiii^GnM (lim„_>,oo Co,„ | Xo,y.) 

Recall T[l/](x,t/) = minae^ (^(x, a) + jma,x^^Ucv^iy,Pi-\x,a))'^d^\x,y,a]. Since 
T is a contraction and Vq is bounded, one obtains 

y*i.x,y) =T[V*]{x,y) = lim T"[Vo](x,y)= lim Vn{x,y) 

n—xx) n—>-oo 

for any {x,y) G X x y. The first and the second equality follow directly from Propo¬ 
sition 2.1 and Proposition 2.2 in 0 and the third equality follows from the definition 
of Vn- Furthermore since Vo{x,y) is bounded for any (x, y) G X x y, the result in 
( [T 2 I 1 implies 

- lim 7"'||Vb||oo < V*ixo,yo)- min CVaRyg ( lim Co.„ | xo,/r) < lim 7”||Vb||oo- 

n—>-oo u^Um \n —>-oo / n—^oo 
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Therefore, by taking n —?> oo, we have just shown that for any (a;o,2/o) S A” x 3^, 
V*{xo,yo) = min^gnM CVaRj^^ (lim„_>oo Co,„ | xo,yL). 

The third part of the proof is to show that for the initial state xq and confidence 
interval y^, we have that 

V*{xo,yo) = min CVaRy^ ( lim Co,n | a^o,^) • 

At any {xt,yt) S X x 3^, we first define the f* tail-subproblem of problem ([^ as 
follows: 

Y{xt,yt)= min CVaRy^ ( lim Ct,„ | Xt, n) 

\n—>-oo / 

where the tail policy sequence is equal to p, = {pt, pt+i,...} and the action is given 
by Qj = y,j{hj) for j > t. For any history depend policy p G If//, we also define the 
p—induced value function as CVaRy^ (lim„_>.oo Ct,n | Xt,]T) where p = {pt, pt+i,...} 
and Uj = ]lj{hj) for j > t. 

Now let y* be the optimal policy of the above f* tail-subproblem. Clearly, the trun¬ 
cated policy p = {Pt+i, Pt+ 2 ) • • ■} is a feasible policy for the (f + 1)* tail subproblem 
at any state Xt+i and confidence interval yt+i- 

min CVaRyj^^ ( lim | Xt+i,p) . 

fjL^Y].H —>00 / 

Collecting the above results, for any pair {xt,yt) G X x y and with at = yl{xt) we 
can write 


Y{xt,yt) =C{xt,at )max E 

^&Xcw^{yt,P(-\xt,at)) 


^{xt+i) -CVaRj, ( lim Ct+i^n \ Xt+i,y] 

\n—>-oo / 


'‘■(xt+i,vt+i),yt+i=yti(xt+-i) 


>C{xt,at)+-i max E^[Y{xt+i,yt^{xt+i))\xt,yt,at]>T[Y]{xt,yt)- 

?eWcvaR(i/t,P(• I a:t,at)) 


The first equality follows from the definition of Y{xt,yt) and the decomposition of 
CVaRs (Theorem|^. The first inequality uses the inequality: Y^{x,y) > Y{x,y), 
f/{x,y) G X X y. The second inequality follows from the definition of Bellman oper¬ 
ator T. 

On the other hand, starting at any state Xt+i and confidence interval yt+i, let p* = 
{P(_i_]^, P(^ 2 i ■ • •} G 11/7 be an optimal policy for the tail subproblem: 

min CVaRy ( lim Ct+p„ | Xt+i,y) . 

\n—^oo J 


For a given pair of {xt,yt) G Xxy, construct the “extended” policy p = {pt, pt+i,...} G 
IIh as follows: 


yt{xt) = u*{xt,yt), and yj{hj) = y*{hj) for j >t + l, 
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where u* {xt, yt) is the minimizer of the fixed point equation 

u*{xt,yt) & axg'rahiC[xt,a)+-i max E^[Y{xt+i,yt^{xt+i))\xt,yt,a], 

a^A ^^Kcv-aR{yt,P{-\xt,a)) 


with yt is the given confidence interval to the t* tail-subproblem and the transition 
from yt to yt+i is given by yt+i = ytCixt+i) where 


r G 


arg max E 

€eWcvaR(yt,-P('|a;t,a*)) 




( lim C; 


I ^t + 1 ; 1 


Since y* is an optimal, and a fortiori feasible policy for the tail subproblem (from time 
t + 1 ), the policy y G If// is a feasible policy for the tail subproblem (from time t): 
min^gn^ CVaRy^ (lim„^oo Ct,n | Xt,y)- Hence, we can write 

< C{xt,yt{xt)) -l- 7 CVaRj^j ( lim I Xt^ yj • 

\n—too / 

Hence from the definition of y*, one easily obtains; 


^{xt,yt) 

<C{xt,u*{xt,yt)) + 7 


max 

^^VlcSaS.{yt,P{-\xt,U* {xt,Vt))) 


=C{xt,u*{xt,yt)) + 7 

=T[V](a;*,yO- 


max 

^&l^cviS.{Vt,P(,-\xt,u*{xt,yt))) 


E k(a;i-ri) • CVaR 5 ( 3 , ) ( lim C(+i,„ 

E^[V(xt+i, ?/tC(xt+i)) I Xt, yt,u*{xt,yt)] 


xt+i,y 


)i 


xt,yt,u*(xt,yt 


Collecting the above results, we have shown that V is a fixed point solution to 
V {x, y) — T[C] (x, y) for any {x, y) G X xy. Since the fixed point solution is unique, 
combining both of these arguments implies V*{x,y) =Y{x,y)foiany{x,y) G Xxy. 
Therefore, it follows that with initial state {x,y), we have V*{x,y) = Y{x,y) = 
min^gna CVaR^ (limT-s-oo Co,t \xo = x,y). 

Combining the above three parts of the proof, the claims of this theorem follows. 


A.4 Proof of Theorem m 

Similar to the definition of the optimal Bellman operator T, for any augmented station¬ 
ary Markovin policy u : X x 3^ —A, we define the policy induced Bellman operator 
Tu as 

Tu[V]ix, y) = C{x,u{x,y))+-f max ^ ^{x')V {x',y^{x')) P{x'\x,u{x,y)). 

^GUcwMy,P{-\x,u{x,y))) ^ 
x' 

Analogous to Theorem|^ we can easily show that the fixed point solution to T„ [C] (x, y) = 

V(x,y) is unique and the CVaR decomposition theorem (Theorem]^ further implies 
this solution equals to 


CVaR 


lim 

T—^OC 


C, 


0,T \ Xq = X,Uh j , 
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where the history dependent policy TTjj = ...} is givenby= u{xk,yk) 

for any k > 0, with initial states xq, j/o = ct, state transitions I®, but with augmented 
stationary Markovian policy u* replaced by u. 

To complete the proof of this theorem, we need to show that the augmented station¬ 
ary Markovian policy u* is optimal if and only if 

T[V*]{x,y) = Tu4V*]ix,y), \/xGX, y Gy, (13) 

where V* {x, y) is the unique fixed point solution of T[l/] (x, y) = V (x, y). Here an 
augmented stationary Markovian policy u* is optimal if and only if the induced history 
dependent policy u’^ in 0 is optimal to problem 0. 

First suppose u* is an optimal augmented stationary Markvoian policy. Then using 
the definition of u* and the result from Theorem]?] that 

V*(x,y) = min CVaR„ ( lim Co,t \xq = x,yL \ , 

\T—)-oo ’ / 

we immediately show that F*(x, y) = Vu*{x,y). By the fixed point equation T[y*](x, 2 /) 
V*{x, y) and T„. [14*](x, y) = 14* (x, y), this further implies ( fT3| ) holds. 

Second suppose u* satisfies the equality in ( [T3| ). Then by the fixed point equality 
T[4*](x,y) = 4* (x, y), we immediately obtain the equation y*(x, y) = T„. [4*](x,y) 
for any x G X and y G y. since the fixed point solution to T„. [l^](x, y) = V{x, y) 
is unique, we further show that T[F*] (x, y) = V* (x, y) = 14» (x, y) and 14* (x, y) = 
min^gH/f CVaRy (limT->.oo Cq.t | Xq = x, /i) from Theorem By using the policy 
construction formula in 0 to obtain the history dependent policy and following 
the above arguments at which the augmented Markovian stationary policy u is replaced 
by u*, this further implies 

min CVaRy ( lim Co,t | xq = x, y) = CVaR„ ( lim Co,t | xq = x, ) , 
i.e., u* is an optimal augmented stationary Markovian policy. 

A.5 Proof of Lemma 1^ 

We first proof the monotonicity property. Based on the definition of Ix[V]{y), if 
Vi (x, y) > 14 (x, y)'ix G X and y Gy, we have that 

-r rx/u t yi+iVi{x,yi+i){y-y^) + y^Vi{x,yi){yi+i-y) 

Ix[Vi\(y) = -, if y G I^x). 

yz+i - y^ 

Since y^.yi+i G y and (y^+i - y),(y - y*) > 0 (because y G li{x)), we can 
easily see that J^[14](y) > l2,[14](y)- As y G y and ^(•)P(-|x, a) > 0 for any 
^ G Z^cvaR(y, Pi-\x, a), this further implies Tx[Vi](x, y) > Ti[14](x, y). 

Next we prove the constant shift property. Note from the definition of Ix [l^j (y) 
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that 


I,[V + K]{y) 

Vi+I - Vi 

=yK + yiV{x,yi) + ^^+ 1 ) —- y^), if y e 1 ^( 0 ::) 


Vi+i - Vi 


=Ix[V]{y) + yK. 


Therefore by definition of Tx[y](a;, y), the constant shift property: 'Ti\V+K]{x, y) = 
Tx[l^](a:, y) + for any x € X, y € y, follows directly from the above arguments. 

Equipped with both properties in monotonicity and constant shift, the proof of con¬ 
traction of Tx directly follows from the analogous proof in Lemma 

Finally we prove the concavity preserving property. Assume yV{x,y) is concave 
\ny & y for any x & X. Then for j/i +2 > yt+i > t/i, Vj G {1,..., N{x) — 2} the 
following inequality immediately follows from the definition of a concave function: 


dIx[V]{y) 


dy 


> 


_ yr+i^{x, yr+i) - yjVjx, yj) 

yGli+i{x) Ui+l — Vi 

yi+ 2 V{x,yi+ 2 ) - yi+iV{x,yi+i) _ dIx[V]{y) 
yi +2 - yi+i dy 


(14) 


i/eli+2(a;) 


We then show that the following inequality in each of the following cases, whenever 
the slope exists: 


Ix[V]{zi)<Ix,[V]{z2) + 


dix[v]{y) 

dy 


{zi - Z2), 

y^z2 


Vzi,Z2 G3^\{0}. 


(1) There exists i G {1, • ■ • ,N{x) 
have that 


dIx[V]{y) 


dy 


— 1} such that zi, Z 2 G Ii+i(a;). In this case we 


y^zi 


dlx[V]{y) 

dy 


y^z2 


and this further implies 


Tx[V]{z^) 


Xx[V]{z 2 )X 


dTx[V]{y) 

dy 


(zi - Z2). 

V=Z2 


(2) There exists i,j G N{x) — 2}, i + 1 < j such that zi G Ii+i(a:) and 

Z 2 G Ijix). In this case, without loss of generality we assume j = i + 1. The proof for 
case: j > i + 2is omitted for the sake of brevity, as it can be completed by iteratively 
applying the same arguments from case: j = i + 2. Since zi G liix), Z 2 G lj{x), we 
have Z 2 — zi > 0 and 


dIx[V]{y) 


dy 


> 


dXxlVM 


dy 
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Based on the definition of the linear interpolation function, we have that 


X^[V]{yi+i) = y^+iV{x,y^+i) =X^[V]{yi) 


dX,[V]{y) 

dy 


{y^+l 

yeli + i(x) 


Vi)- 


Furthermore, combining previous arguments with the definitions ofXx\V]{zi),Xx\V]{z 2 ) 
implies that for (22 — 2/i+i) > 0, 


X,[V]{Z2) =X,[V]{y,+i) + 


dX,[V]{y) 


<Xx\V\{yi+i) + 


dy 

dX,[V]{y) 


dy 


(z2 -y^+l) 


(Z2 - y^+l) 


=X,[V]{yi) + 
=X,[V]{zi) + 


dX,[V\{y) 


dy 

dX,[V]{y) 


dy 


{z2 - yi) 

yeli+i(a;) 

{Z2 - Zi). 


(3) There exists i,j € N{x) — 2}, i + 1 < j such that Z 2 € Ii+i(a;) and 

Zi G Ijix). In this case, without loss of generality we assume j = i + 1. The proof for 
case: j > i + 2 is omitted for the sake of brevity, as it can be completed by iteratively 
applying the same arguments from case: j = i + 2. Since Z 2 G Ii+i(x), zi G Ij(x), 
we have zi — Z 2 > 0 and 


dX,[V]iy) 


dy 


< 


dX^[V]iy) 


dy 


Similar to the analysis in the previous case, we have that 




y,V{x,yi) 


X,[V]{y,+i) + 


dX,[V]{y) 

dy 


(y* 

i/eli+i(a:) 


y^+l) 


Furthermore, combining previous arguments with the definitions ofXx\V]{zi),Xx\V]{z 2 ) 
implies that for (22 — ^ 1 ) < 0, 


X,[V]{Z 2 ) =X,[V]{yi) G- 
=X,[V]{y,+{) 
=X,[V]{zi) + 
<X,[V]{zi) + 


dX,[V\{y) 


dX,[V]iy) 


dy 

dX,[V]{y) 


dy 

dX,[V]{y) 


dy 


{z2 - yi) 

{Z2 - y^+l) 

-Z2 

{Z2 - Zi) 

{Z2 - Zi). 


Thus we have just shown that the first order sufficient condition for concave func¬ 
tions, corresponding X.oXx\V]{y), holds, i.e., Xx\V]{y) is concave in y € 3^ \ {0} for 
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any given x G X. Now since Ix[V]{y) is a continuous piecewise linear function in 
y G y and a concave function when the domain is restricted to 3^ \ {0}. By continuity 
this immediately implies that Ix[V]{y) is concave in y G 3^ as well. Then following 
the identical arguments in the proof of Lemma[^for the concavity preserving property, 
we can thereby show that 

yTi[V]{x,y) =minlyC{x,a) + max ^ I^^[V]{y^{x'))P{x'\x,a) 
a&A €eWcvaR(y,P(-|a:.a)) 


is concave in y G 3^ for any given x G X. 

A.6 Useful Intermediate Results 

Lemma 8 Let f{y) : [0,1] -A R be a concave function, differentiable almost ev¬ 
erywhere, with Lipschitz constant M. Then the linear interpolation I[/](y) is also 
concave, and with Lipschitz constant Mj < M. 

Proof For every segment [yj,yj+i] in the linear interpolation, /(y) is concave, and 
with Lipschitz constant M, and X[/] (y) is linear. Also,/(yj) = I[/](yj), and/(yj+i) = 

I[f] (yj+i), by definition of the linear interpolation. Denote by cj the magnitude of the 
slope ofI[/](y) at y G [yj,yj+i]. 

Assume by contradiction that cj > |/'(y)| whenever /'(y) exists. 

Consider the case when /(y^+i) > f{yj). This implies Cj is the slope of the interpola¬ 
tion function X\f]{y) at y G [y^, y^+ij. Then by the fundamental theorem of calculus, 
we have 

rVj+i fVj+i fUj+i 

/(yj+i)-/(%) = / f'{y)dy< \f'iy)\dy< Cjdy = {I[f]{yj+i)-I[f]{yj)), 

Jyj Jyj Jyj 

contradicting /(y^+i) = l[f]{yj+i) and /(y^) = T[f]{yj). 

On the other hand, consider the case when /(y^+i) < f{yj). This implies —Cj is 
the slope of the interpolation function X[f] (y) at y G [yj , yj-ri] • Again by fundamental 
theorem of calculus, 

rVj+i rVj+i rVj+i 

0 < /(yj+i)-/(yj) = / f{y)dy> -\fiy)\dy> -Cjdy =X[f]{yj)-X[f]{yj+i). 

Jyj Jyj Jyj 

Since/(y^+i) = I[/](yj+i) and/(y^) = J[/](yj), whichimpliesI[/](yj)-J[/](yj+i) > 

0 , the above expression clearly leads to a contradiction. 

We finally have that \f'{y)\ > Cj for segment y G {1,..., N{x) - 

1}. As this argument holds for each segment, by maximizing over j over {1,..., A^(a;) — 

1 }, we have that 

M > max max |/^(y)| > max c,- = Mj. 

The concavity property (thus differentiability almost everywhere) are well-known 
results of linear interpolation ifTSl . 
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Lemma 9 Let yV(x, y) be Lipschitz with constant M, concave, and differentiable al¬ 
most everywhere, for every x G X and y £ [0,1]. Then y’T\y]{x, y) is also Lipschitz 
with constant Cmax + yM. 


Proof For any given state-action pair x & X, and a G A, let P{x') = P{x'\x, a) be 
the transition kernel. Consider the function 

H{y) = max ^ y^{x')V {x', y^{x' j) P{x'). 
jGWcVaR y,P • trt 

x'G A: 

Note that, by definition of Z^cvaR, and a change of variables z{x') = y^{x'), we can 
write H{y) as follows: 

H(y) = max z{x)V{x,z{x))P{x). (15) 

The Lagrangian of the above maximization problem is 

Liz, A; y) = ziff)V (x', z(x')) P(x') - A)^^ P{x')z{x') - y). 

x'^X x' 

Since yV{x, y) is concave, the maximum is attained. By first order optimality condi¬ 
tion the following expression holds: 

dLiz,X-,y) ^ j, d[zix’)V ix’,zix'))] _ ^ 

dzix') dzix') 

Summing the last expression over x', we obtain: 

x'^X ^ ' x'^X 

Now, from the Lipschitz property of yC(a;, y), we have 

d[zix')V ix',z{x'))] 


E 


x'^X 


< E pi^') 


x'^X 


Thus, 


|A| < E 


x'ex 


dzix') 
d[zix')Vix',zix'))] 


< E Pix')M = M. 


dzix') 


x'ex 


< M. 


Note that the objective in ( |T5] l does not depend on y. From the envelope theorem M, 
it follows that 

dHiy) 


dy 


= A, 


therefore, H (y) is Lipschitz, with constant M. 
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Now, by definition. 


yT:[V]{x,y) = min 
a^A 


yC{x,a)+-i max yi{x')V {x\yi{x')) P{x'\x,a) 

«eWcvaR !/.P -k.a 

x'^X 


Using our Lipschitz result for H{y), we have that for any a G A, the function 


yC{x,a)+j max y^(A)V {A ,y^{A)) P{A\x,a) 

^eWcVaR(y,-P(-k,a)) 77^ 
x'GX 

is Lipschitz in y, with constant C{x, a) + 7 M. Using again the envelope theorem lfT4l . 
we obtain that ?/T[U](a;, y) is Lipschitz, with constant Umax + jM. 

Lemma 10 Consider Algorithm^ Assume that for any x G X, the initial value func¬ 
tion satisfies that ylfi^x, y) is Lipschitz (in y), with uniform constant Mq. We have that 
for any t G {0,1,..., the function yVt{x, y) is Lipschitz in y for any x G X, with 
Lipschitz constant 

Mt = ^ + Mo, VL 

1 — 7 1 ~ 7 


Proof Let Tx[U] denote the application of the Bellman operator T to the linearly- 
interpolated version of yV{x, y). We have, by definition, that 

Vi(x,y) = Tx[Vo]{x,y). 

Using Lemma|^and Lemma|^ we have that Vi{x, y) is Lipschitz, with Mi < Umax + 
7 M 0 . 

Note now, that V 2 {x, y) = Tx[Vi](a:, y). Thus, by induction, we have 

Mt < -y Cmax + 7*-^0) 

1-7 


and the result follows. 


A.7 Proof of Theorems 

The proof of this theorem is split into three parts. In the first part, we bound the 
difference Ix\yt\{y)/y — Vt(x,y) at each state (x,y) G X x y using the previous 
technical lemmas and Lipschitz property. 

In the second part, we bound the difference of Tx[Vt] (x, y) — T[Vt] (x, y). 

In the third part we bound the interpolation error using contraction properties of 
Bellman recursions. 

First we analyze the bounds forXx\Vt](y)/y — Vt{x, y) in the following four cases. 
Notice that from Lemma 10 we have that \dTx\Vt\{y)/dy\ < M := Cmax/(1 ~ 7 ) + 
Mo. 
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(1) When y = 0 (for which y G Ii(x)). 

Using previous analysis and L’Hospital’s rule we have that limy_>o2ia;[Vt](j/)/y = 
Vt{x, 0). This further implies XmYy^Qlx\Vt\{y)/y — Vt{x, 0) = 0. 

(2) When y G li+i{x), 2<i< N{x) - 1. 

Similar to the inequality in ([l4|), by concavity of yVt{x, y) in y G y, we have that 


diAVtKy) 

dy 

and 

caAVtM 


yi+iVt{x,yi+i) - ytVt{x,yi) yVt{x,y) - yiVt{x,yA 


< 


y€li+i(x) 


y^+1 - yi 


y-y^ 


dy 


yi+ 2 Vt{x,yi+ 2 ) - y^+lVt{x,yi+l) y^+iVt{x,yi+i) - yVt{x,y) 


< 


yeli+2(x) yi+2 yi+1 Vi+l 2/ 

From the first inequality, for each {x,y) G X x y we get, 

yi+iVt{x,yi+i) - y^Vt{x,yi) 


UVt]{y) ,,, ^ / 1 ^ ^ , 

- Vt{x,y) < - yiVt{x,yi) H- 

y y \ 2/*+i - yi 

On the other hand, rearranging the second inequality gives 


(y-Vi) -yVt(.x,y )) < 0. 
(16) 


-iTx[Vt]{y) - yVt{x,y)) 

T/t ^^dl4Vt]{y) 
>- yiVt[x,y^) + 


dy 


. ^ T.. ^ dlAVtM 

[y - y^) - yi+iVt{x, yi+i) - - - 

y&U+i{x) ^y 


yGli+2{x) 


{y - yt+i) 


dIx[Vt]iy) 


dy 


dIx[Vt]{y) 


y€li+i(x) 


dy 


yGli+2(a:)y 


y - yi+i 
y 


> -2M 


y^+^ 


-1 . 


(17) 


Furthermore by the Lipschitz property, we also have the following inequality as well; 


-{Ix[Vt]{y) - yVt{x,y)) 

y 

y^+lVt{x,yi+l){y - yi) + yiVt{x,yi){y,+i - y) 


- Vt{x,y) 


> 


(t/i+i - yi)y 

yiVt{x,yi){y - yi) + yiVt{x,yi){y^+i - y) - M{yi+i - yi){y - y^ 


{Vi+i - yi)y 


yiVt{x,yi) - M{y- yj) 

y 


-Vt{x,y)>-2M[1-^]. 


- Vt{x,y) 


(18) 


Combining the inequalities ( [TtI i and ( [T8] l, the following lower bound for lAA]{y)/y — 
Vt{x,y) holds: 

^{Ix[Vt]{y)-yVt{x,y)) > 6 := -2Mmin|l - - l| , Vy G Ij+i(a;), i > 2. 
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From the above definition, when yi < y < {yi + 2/i+i)/2, the lower bound becomes 
S = —2M{1 — yi/y) and when {yi + j/i+i)/2 < y < yi+i, the corresponding lower 
bound is (5 = —2M{yi+i/y—l). In both cases, (5 is minimized when j/ = (yi+yi+i)/2. 
Therefore, the above analysis implies the following lower bound; 

-iIAVt]{y) - yVtix,y)) > Vy S I,+i(x), z > 2. 

y Vi+i + Vi 


When z/i+i = 9yi for i G {2,..., JV(x) — 1} for some constant 0 > 1, this further 
implies that 

-(I,[Vt](y) - yVt(x,y)) > -2M^ > -M(0 - 1), Vye3;\[0,e]. 

y u +1 

Then combining the results, here we get the following bound forIx\Vt]{y)/y—Vt{x^ y): 
-M{9 - 1) < _ Vt{x,y) < 0, Vy e I,+i(a;), i > 2. 

y 

(3) When y e Ijv(x)(a;), i-e., y e {yNix)-i, !]■ 

Similar to the proof of case (2), we can show that for any x G X and y G lAr(a;) {x), the 
same lines of arguments in inequality ( [Tfil l and ( [T8] l hold, which implies 

-2M (1 - yjv(.)-i) < -2M (^1 - < hXx[Vt]{y) - yVt{x, y)) < 0. 


When yN-(x) = 1 = (^yN(x)-i’ this further shows that 

-2MyA,(^)_i(6» - 1) = -2M {yN{x) - 2/7V(x)-i) < ^{Ix[Vt]{y) - yVt{x,y)) < 0, 


and 

-2M{9 - 1) < -Im{9 - 1) < -{Ix.[Vt]{y) - yVt{x,y)) < 0. 
y y 

(4) When y e l 2 {x), i.e., y e (0, y 2 ]. 

From inequality ([T6]l, the definition of \Vt] (y), we have that 


'Ix[Vt]{v) - yVt{x,y) 

y 


yiVt{x,y 2 ) - Vt{x,y)) 

y 


Vt{x,y 2 )-Vt{x,y) > Vt{x,y2)-Vt{x,0). 


The first inequality is due to the fact that yVt{x, y) is concave 'my Gy for any x G X, 
thus the first order condition implies 


y 2 Vn{x,y 2 ) - yiVn{x,yi) ^ yVn{x,y) - yiVn{x,yi) ^ 

2/2 - yi “ 2/ - 2/1 

and the last inequality is due to the similar fact that 


Vt{x,w) 


wVt{x,w) - 0 ■ Vt{x,0) ^ zVt{x,z) - 0 ■ 14^(3:, 0) 
w — 0 ~ z — 0 


Vt{x,z), WZjWGy, z<w. 
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Therefore the condition of this theorem implies 


Q > '^x[Vt]{y) - yVt{x,y) 

y 


> —e, Vt > 0, X e A", y G y. 


Combining the above four cases, we have that for each state {x,y) G X x y, 


0> _ Vt{x,y) >-2M(6» - 1) - e, Vf. 

y 

Second, we bound the difference of 'Ti\Vt]{x, y) — T[Vi](x, y). By recalling that 
^(•)P(-|a;, a) is a probability distribution for any ^ G Uc\aR{y,P{'\x,a)), we then 
combine all previous arguments and show that at any t G {0,1,..., } and any x G X, 
a G A, y G Y(x), 


max 

iel^CVaR(y,-P(-lx,a)) 




)#0 


TAVt]{y^{^')) 

y^{x') 


- Vt{x',y^{x')) I ^{x')P{x'\x,a) > 


This further implies 


T[Vt]{x,y)-A^Mie-l) + e) < Tx[Vt]{x,y) < T[Vt]{x,y). (19) 


Third, we prove the error bound of interpolation based value iteration using the 
above properties. By putting f = 0 in ( [T9] ), we have that 

-7(2M(0- 1) + e) < Tx[VoKx,y) - T[Vo]{x,y) < 0. 


Applying the Bellman operator T on all sides of the above inequality and noting that 
T is a translational invariant mapping, the above expression implies 

T^[Vo]{x,y)-Am{e-l) + e) < T[Tx[Vo]]{x,y) = T[V,]{x,y) < T^[Vo]{x,y). 

By adding the inequality: — 1) + e) < Tx[Vi](a;, y) — T[Vi](a:, y) < 0 to 

the above expression, this further implies the following expression: 

T2[Co](x,j/)-7(l+7)(2M(0-l)+e) < Ti[yi](x,t/) = T|[Co](x,y) < T^mA.y) 

Then, by repeating this process, we can show that for any n G N, the following in¬ 
equality holds: 

T"[Vo](x, 2/) - 7^^^(2M(0 - 1) +e) < < X^^Ay). 

1-7 

Note that when n —>■ cx), we have that 7 " converges to 0, T"'[Vb](a::, 2 /) converges 
to min^gn/f CVaR^ (limx_>,oo Co,t I x, y) (follow from Theorem]^ and Tj[Vo]{x, y) 
converges to V*{x, y) (follow from the contraction property in Lemma|^. 

Furthermore, from Proposition 1.6.4 in 131, the contraction property of Bellman 
operator T implies that for any x G X, y G y, the following expression holds: 

\T:^[VAx,y)-V*{x,y)\ < + ||^||oo) 

1-7 


2 M(6»-l)-e. 
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CVaR a = 0.1 


CVaR o = 1 



Figure 2; Grid-world - trajectory plots. 


where Z is the bounded random variable of the initial value function Vo{x,y) = 

CWaRyiZ I a;o = a;) suchthat IjFolloo < \\Z\\^,andV* {x,y) = min^gn« CVaRj^ (limT->oo Co,t | x,y). 
This further implies for any x G X, y G y, 

\T^[VoKx,y) - V*ix,y)\ < 7 —^( 2 M (0 - 1) + e) + + \\Z\\^). 

1 — 7 1 — 7 

Then, by combining all the above arguments, we prove the claim of this theorem. 


B Trajectory Plots 

In Figure]^ we demonstrate simulated trajectories according to a policy that is greedy 
w.r.t. the value function, according to Theorem]^ 


C Generalization to Mean-CVaR Optimization 

In this section we extend our approach to MDPs with a mean-CVaR objective of the 
form: 

min AE ( lim Cq t \ Xq, p-') + (1 — AjCVaR^ ( lim Cg t I Xq, fx] , (20) 

\ T —^00 / \ T —^00 / 

where A G [0,1]. Such an objective is common in practice ifTTI . and is also use¬ 
ful for solving CVaR-constrained objectives using standard Lagrangian methods (see, 
e.g., 0). 

Now for any ai,a 2 G [0,1], define 

p^(Z I = ACVaR„,(Z I Ht,M) + (1 - A)CVaR„,(Z | 

and notice that pa(Z | Ht,p,) = AE (Z | Ht,p) -I- (1 — AjCVaRo, {Z \ Ht,p) when 
the vector of CVaR confidence intervals is given by a = (1, a). 
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Theorem 11 For any t > 0, denote by Z = {Zt+i, Zt+ 2 ^ • ■ •) the cost sequence from 
time t \ onwards. The conditional mean-CVaR risk metric under policy p, obeys the 
following decomposition: 

Pa{Z I Ht,p) = max E[SA(C(xt+i)) • Pas^(a^t+i))iZ \ i^t+l) | Ht] 

^^U2CVaR{(^yP{-\Xt,at)) 

where a = ( 01 , 0 : 2 ) is the vector of CVaR confidence intervals. The risk envelop is 
given by 


U2CVaR{oi, P{-\xt,at)) 


£, = (6,6) : 


0 ,- 

Oi-i 


£,i{xt+i)P{xt+i\xt,at) 

Xt+iGAT 



Sa( 6 : > R is a linear operator given by A6 + (1 ~ '^)6 ond at is the control 

input induced by policy pt{ht)- 

Now we extend the above analysis to Bellman recursion. With the generic state space 
y = [ 0 , 1 ]^, we now define the optimal Bellman operator at any (x, y) G X x y. 


T\V]{x, y) = min 
aeA 


C(x,a)+y max V SA(6a^'))^ 6', J/Sa)^^;'))) a) 

^^U2c\aR{y,P{-\x,a)) ^ 
x'£X 

( 21 ) 


Based on the decomposition result from Theorem 11 we now have the result on the 
convergence of Bellman recursion, analogous to Theorem and showing that the 
fixed point solution of T[y](a;,y) = V{x,y) k unique and equals to the solution of 
with xo = X and yo = (1, a). 


Theorem 12 For any state x G X and y = (j/i, 2 / 2 ) G [0,1]^, the fixed point solution of 
T[l/](a:, y) = V(x, y) is unique and is equal to V(x, y) := min^gn_ff XCVaRy^ (lim-r-ioo Cq.t | xq, p)+ 
(1 — X)CVaRy^ (limr_>oo Cq^t I Xq, p). Furthermore, let p* = {pQ, pi,...} G IIi:/ be 
a policy recursively defined as in 0 with two-dimensional augmented state {yj\ and 
initial condition yo = (1, a). Then p* is an optimal policy for the mean-CVaR problem 
with initial condition xg and CVaR confidence level a. 


Extending the interpolation-based CVaR value iteration (Algorithm[2l for this case 
is straightforward, using a 2-D linear interpolation for yV(x, y). 
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